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Abstract — We consider the data-driven dictionary learning 
problem. The goal is to seek an over-complete dictionary from 
which every training signal can be best approximated by a 
linear combination of only a few codewords. This task is often 
achieved by iteratively executing two operations: sparse coding 
and dictionary update. In the literature, there are two benchmark 
mechanisms to update a dictionary. The first approach, such as 
the MOD algorithm, is characterized by searching for the optimal 
codewords while fixing the sparse coefficients. In the second 
approach, represented by the K-SVD method, one codeword 
and the related sparse coefficients are simultaneously updated 
while all other codewords and coefficients remain unchanged. We 
propose a novel framework that generalizes the aforementioned 
two methods. The unique feature of our approach is that one 
can update an arbitrary set of codewords and the corresponding 
sparse coefficients simultaneously: when sparse coefficients are 
fixed, the underlying optimization problem is similar to that in the 
MOD algorithm; when only one codeword is selected for update, 
it can be proved that the proposed algorithm is equivalent to the 
K-SVD method; and more importantly, our method allows us to 
update all codewords and all sparse coefficients simultaneously, 
hence the term simultaneous codeword optimization (SimCO). 
Under the proposed framework, we design two algorithms, 
namely, primitive and regularized SimCO. We implement these 
two algorithms based on a simple gradient descent mechanism. 
Simulations are provided to demonstrate the performance of the 
proposed algorithms, as compared with two baseline algorithms 
MOD and K-SVD. Results show that regularized SimCO is 
particularly appealing in terms of both learning performance 
and running speed. 
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I. Introduction 

Sparse signal representations have recently received exten- 
sive research interests across several communities including 
signal processing, information theory, and optimization |1|, 
10 j llll- The basic assumption underlying this technique 
is that a natural signal can be approximated by the combination 
of only a small number of elementary components, called 
codewords or atoms, that are chosen from a dictionary (i.e., the 
whole collection of all the codewords). Sparse representations 
have found successful applications in data interpretation ||5l, 
||6l, source separation [TJ, tSl, ||9|, signal denoising ifTOll . 
ifTTl . coding 111, lHa, lEl, classification HSl, Hg), El, 
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recognition ifTSl . impainting |fT9ll , ll20ll and many more (see 
e.g. I21j)- 

Two related problems have been studied either separately or 
jointly in sparse representations. The first one is sparse coding, 
that is, to find the sparse linear decompositions of a signal 
for a given dictionary. Efforts dedicated to this problem have 
resulted in the creation of a number of algorithms including ba- 
sis pursuit (BP) II22II . matching pursuit (MP) |23|, orthogonal 
matching pursuit (OMP) 1241, subspace pursuit (SP) ll26l . 
[27], regression shrinkage and selection (LASSO) f28l, focal 
under-determined system solver (FOCUS S) |29|, and gradient 
pursuit (GP) 1301 . Sparse decompositions of a signal, however, 
rely highly on the degree of fitting between the data and the 
dictionary, which leads to the second problem, i.e. the issue 
of dictionary design. 

An over-complete dictionary, one in which the number of 
codewords is greater than the dimension of the signal, can be 
obtained by either an analytical or a learning-based approach. 
The analytical approach generates the dictionary based on a 
predefined mathematical transform, such as discrete Fourier 
transform (DFT), discrete cosine transform (DCT), wavelets 
|31|, curvelets [32l, contourlets f33l, and bandelets fM\. Such 
dictionaries are relatively easier to obtain and more suitable 
for generic signals. In learning-based approaches, however, the 
dictionaries are adapted from a set of training data |5|, f35l, 
Ei, EH, ED, Ho), mni, m], m. Although tWs may 
involve higher computational complexity, learned dictionaries 
have the potential to offer improved performance as compared 
with predefined dictionaries, since the atoms are derived to 
capture the salient information directly from the signals. 

Dictionary learning algorithms are often established on 
an optimization process involving the iteration between two 
stages: sparse approximation and dictionary update. First an 
initial dictionary is given and a signal is decomposed as 
a linear combination of only a few atoms from the initial 
dictionary. Then the atoms of the dictionary are trained with 
fixed or sometimes unfixed weighting coefficients. After that, 
the trained dictionary is used to compute the new weighting 
coefficients. The process is iterated until the most suitable 
dictionary is eventually obtained. 

One of the early algorithms that adopted such a two-step 
structure was proposed by Olshausen and Field IQ, If35l . 
where a maximum likelihood (ML) learning method was used 
to sparsely code the natural images upon a redundant dictio- 
nary. The sparse approximation step in the ML algorithm Q 
which involves probabilistic inference is computationally ex- 
pensive. In a similar probabilistic framework, Kreutz-Delgado 
et al. 1371 proposed a maximum a posteriori (MAP) dictionary 
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learning algorithm, where the maximization of the likelihood 
function as used in f5"| is replaced by the maximization of 
posterior probability that a given signal can be synthesized by 
a dictionary and the sparse coefficients. Based on the same 
ML objective function as in |5|, Engan et al. fiE\ developed 
a more efficient algorithm, called the method of optimal 
directions (MOD), in which a closed-form solution for the 
dictionary update has been proposed. This method is one of the 
earliest methods that implements the concept of sparification 
process ll43l . Several variants of this algorithm, such as the 
iterative least squares (ILS) method, have also been developed 
which were summarized in [44J. A recursive least squares 
(RLS) dictionary learning algorithm was recently presented 
in P3]| where the dictionary is continuously updated as each 
training vector is being processed, which is different from the 
ILS dictionary learning method. Aharon, Elad and Bruckstein 
developed the K-SVD algorithm in |TOl by generalizing the 
K-means algorithm for dictionary learning. This algorithm 
uses a similar block-relaxation approach to MOD, but updates 
the dictionary on an atom-by-atom basis, without having to 
compute matrix inversion as required in the original MOD 
algorithm. The majorization method was proposed by ll46l 
in which the original objective function is substituted by a 
surrogate function in each step of the optimization process. 

In contrast to the generic dictionaries described above, 
learning structure-oriented parametric dictionaries has also 
attracted attention. For example, a Gammatone generating 
function has been used by Yaghoobi et al. WT\ to learn 
dictionaries from audio data. In |48|, a pyramidal wavelet- 
like transform was proposed to learn a multiscale structure 
in the dictionary. Other constraints have also been considered 
in the learning process to favor the desired structures of the 
dictionaries, such as the translation-invariant or shift-invariant 
characteristics of the atoms imposed in ll49l , ll50l . lISTl . Il52l . 
||5?I and the orthogonality between subspaces enforced in ll54ll . 
and the de-correlation between the atoms promoted in f 35]| . An 
advantage of a parametric dictionary lies in its potential for 
reducing the number of free parameters and thereby leading 
to a more efficient implementation and better convergence of 
dictionary learning algorithms ll43l . Other recent efforts in 
dictionary learning include the search for robust and compu- 
tationally efficient algorithms, such as 1561 . ifSTll . and ifTTIl . 
and learning dictionaries from multimodal data ll58l . ll59ll . 
Comprehensive reviews of dictionary learning algorithms can 
be found in recent survey papers e.g. |43| and |60|. 

In this paper, similar to MOD and K-SVD methods, we 
focus on the dictionary update step for generic dictionary 
learning. We propose a novel optimization framework where 
the dictionary update problem is formulated as an optimization 
problem on manifolds. The proposed optimization framework 
has the following advantages. 

• In our framework, an arbitrary subset of the codewords 
are allowed to be updated simultaneously, hence the 
term simultaneous codeword optimization (SimCO). This 
framework can be viewed as a generalization of the MOD 
and K-SVD methods: when sparse coefficients are fixed, 
the underlying optimization problem is similar to that in 
the MOD algorithm; when only one codeword is selected 



for update, the optimization problems that arise in both 
SimCO and K-SVD are identical. 

• Our framework naturally accommodates a regularization 
term, motivated by the ill-condition problem that arises in 
MOD, K-SVD and primitive SimCO (detailed in Section 
rvT l. We refer to SimCO with the regularization term 
as regularized SimCO, which mitigates the ill-condition 
problem and hence achieves much better performance 
according to our numerical simulations. Note however 
that it is not straightforward to extend MOD or K-SVD 
to the regularized case. 

« Though our implementation is based on a simple gradient 
descent mechanism, our empirical tests show that the 
regularized SimCO that updates all codewords simultane- 
ously enjoys good learning performance and fast running 
speed. 

Furthermore, we rigorously show that when only one code- 
word is updated in each step, the primitive SimCO and K- 
SVD share the same learning performance with probability 
one. As a byproduct, for the first time, we prove that a gradient 
search on the Grassmann manifold solves the rank-one matrix 
approximation problem with probability one. 

The remainder of the paper is organized as follows. Section 
nil introduces the proposed optimization formulation for dic- 
tionary update. Section |lll] provides necessary preliminaries 
on manifolds and shows that dictionary update can be cast as 
an optimization problem on manifolds. The implementation 
details for primitive and regularized SimCOs are presented in 
Sections |IV] and [V) respectively. In Section IVII we rigorously 
prove the close connection between SimCO and K-SVD. Nu- 
merical results of SimCO algorithms are presented in Section 
IVIII Finally, the paper is concluded in Section IVIIII 

II. The Optimization Framework of SimCO 

Dictionary learning is a process of which the purpose is 
to find an over-complete dictionary that best represents the 
training signals. More precisely, let Y E jjmxn jj^g 
training data, where each column of Y corresponds to one 
training sample. For a given dictionary size d S Z"*", the 
optimal dictionary D* G M™^"^ is the one that corresponds to 
inf^gR^xd^ xsK^ix" 11"^ - DX\\p, where ||-||^ is the Frobe- 
nius norm. Here, the i*^ column of D is often referred to as 
the i*'* codeword in the dictionary. In practice, it is typical that 
m < d < n, i.e., an over-complete dictionary is considered 
and the number of training samples is larger than the number 
of codewords. Generally speaking, the optimization problem is 
ill-posed unless extra constraints are imposed on the dictionary 
D and the coefficient matrix X. The most common constraint 
on X is that X is sparse, i.e., the number of nonzero entries 
in X, compared with the total number of entries, is small. 

Most dictionary learning algorithms consist of two stages: 
sparse coding and dictionary update. See Algorithm [T] for 
the diagram of a typical dictionary learning procedure. In the 
sparse coding stage, the goal is to find a sparse X to minimize 
||y-£)X||^for a given dictionary D. In practice, the sparse 
coding problem is often approximately solved by using either 
£i -minimization |611 or greedy algorithms, for example, OMP 
Ea and SP ESI algorithms. 
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Algorithm 1 A typical dictionary learning algorithm 

Task: find the best dictionary to represent the data sample 

matrix Y. 

Initialization: Set the initial dictionary D'^^\ Set J = 1. 
Repeat until convergence (use stop rule): 

• Sparse coding stage: Fix the dictionary D^'^^ and update 

using some sparse coding technique. 

• Dictionary update stage: Update D'^'^\ and X^'^^ as 
appropriate. 

. J = J+l. 



The focus of this paper is on the dictionary update stage. 
There are different formulations for this stage, leading to 
substantially different algorithms. In the MOD L36J method, 
one fixes the sparse coding matrix X and searches for the 
optimal dictionary D, and hence essentially solves a least 
squares problemlJ By contrast, in the approach represented by 
the K-SVD method, one updates both the dictionary D and 
the nonzero coefficients in X. In particular, in each step of the 
dictionary update stage of the K-SVD algorithm, one updates 
one codeword of the dictionary D and the nonzero coefficients 
in the corresponding row of the matrix X. After sequentially 
updating all the codewords and their corresponding coeffi- 
cients, the only element fixed is the sparsity pattern, that is, 
the locations of the non-zeros in X. As has been demonstrated 
empirically in [lOJ . the K-SVD algorithm often enjoys faster 
convergence and produces a more accurate dictionary when 
compared with the MOD method. 

The key characteristic of our approach is to update all 
codewords and the corresponding non-zero coefficients simul- 
taneously. In our formulation, we assume that the dictionary 
matrix D contains unit £2-110™ columns and the sparsity 
pattern of X remains unchanged. More specifically, define 



v = {De 



D m X d 



\D..4^ = 1, Vi e [d]} 



(1) 



where ||-||2 is the ^2-norm and the set [d] = {1, 2, • • • , d}. The 
sparsity pattern of X is represented by the set il C [d] x [n] 
which contains the indices of all the non-zero entries in X: 
that is, Xi^j 7^ for all e ft and Xij = for all 

(i, j) ^ il. Define 



x{n) = {x e 



idxr 



(2) 



The dictionary update problem under consideration is given 
by 

inf inf \\Y-DX\\l. (3) 
Dev xex(n) ^ 

Note that the optimal X that minimizes \\Y — DX\fp varies 
as D changes. An update in D implies an update of the 
corresponding optimal X. Hence, both D and X are simulta- 
neously updated. We refer to this optimization framework as 
primitive SimCO. 

'when there are no constraints on the norm of the columns of D, 
minimizing \\Y — DX\\'^ for given Y and X is a standard least squares 
problem and admits a closed-fonn solution. When extra constraints on the 
column norm are imposed, as we shall show shortly, the optimization problem 
is a least squares problem on a product of manifolds. No closed-form solution 
has been found. 



Another optimization framework proposed in this paper is 
the so called regularized SimCO. The related optimization 
problem is given by 



inf inf 

Dev xex(n) 



\Y DX\ 



IXI 



(4) 



where /i > is a properly chosen constant. The motivation 
of introducing the regularization term ||X||^ is presented in 
Section [V] 

The ideas of SimCO can be generalized: instead of updating 
all codewords simultaneously, one can update an arbitrary 
subset of codewords and the corresponding coefficients. More 
precisely, let I C [d] be the index set of the codewords to 
be updated. That is, only codewords D i's, i E I, are to 
be updated while all other codewords D.j's, j ^ I, remain 
constant. Let D. x denote the sub-matrix of D formed by the 
columns of D indexed by I. Let Xx.-. denote the sub-matrix 
of X consisting of the rows of X indexed by I. Define 

Yr^Y -D.„x^-Xx^., 

where I'^ is a set complementary to I. Then Y — DX = 
Y,. — D-_xXx.:. Then the optimization problems in SimCO 
can be written as 

inf fx (D) , 
where the objective function fx (D) is given by 



fx [D) = inf 



\Yr~D,rX 



Xx,:: xex(n) 



(5) 



for primitive SimCO and 



inf iiy; 

Xx,:: xex(n) 



D..,xXx^\l + ^i\\Xx,\\l 



fx{D) 

Xx ■■■ xex(ii) 

(6) 

for regularized SimCO, respectively. The algorithmic details 
for solving primitive and regularized SimCO are presented in 
Sections |IV] and |V] respectively. 

The connection between our formulation and those in MOD 
and K-SVD is clear When sparse coefficients are fixed, 
the underlying optimization problem is similar to that in 
MOD. When only one codeword is selected for update, the 
formulation in (|5]l is identical to the optimization formulation 
treated in K-SVD. 

There are also fundamental differences between our frame- 
work and those in MOD and K-SVD. Compared with MOD, 
our formulation puts a constraint ([T]) on the ^2-norm of the 
columns of the dictionary matrix. This constraint is motivated 
by the following reasons. 

1) The performance of a given dictionary is invariant to the 
column norms. The performance of a given dictionary 
D is described by how the product DX approximates 
the training samples Y. By scaling the corresponding 
rows in X, one can keep the product DX invariant to 
any nonzero scaling of the columns in D. 

2) A normalized dictionary D E V is preferred in the 
sparse coding stage. Sparse coding algorithms rely 
heavily on the magnitudes of the coefficients X^ j's, 

e [d] X [n], which are affected by the column 
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norms of D. It is a standard practice to normalize the 
columns of D before applying sparse coding algorithms. 
3) A normalized dictionary £) e 2? is required in regular- 
ized SimCO. The regularization term /z is useful 
only when the column norms of D are fixed. To see 
this, let Di,D2 G M™^'^ be two dictionaries whose 
columns are only different in scaling; it can be shown 
that in this case the optimal X for the minimization of 
— DX\\^p + /i ||X||^ can be very different and so 
is the regularization term. 
More subtly, the singularity phenomenon that motivates reg- 
ularized SimCO depends upon the normalized columns. This 
point will be detailed in Section W\ 

Our formulation naturally accommodates an inclusion of 
the regularization term in (01). As will be shown in Sections 
rvl and IVIII the regularization term improves the learning 
performance significantly. Note that it is not clear how to 
extend MOD or K-SVD for the regularized case. In the 
dictionary update step of MOD, the coefficient matrix X is 
fixed. The regularization term becomes a constant and does 
not appear in the optimization problem. The main idea of K- 
SVD is to use SVD to solve the corresponding optimization 
problem. However, it is not clear how to employ SVD to solve 
the regularized optimization problem in (|6]l when \I\ = 1. 

III. Preliminaries on Manifolds 

Our approach for solving the optimization problem (O 
relies on the notion of Stiefel and Grassmann manifolds. In 
particular, the Stiefel manifold Um,i is defined as Um.i = 
{tt G K™ : u^u = l} • The Grassmann manifold Qm,i is de- 
fined as Qm^i = {span (it) : u G Um.i] ■ Here, the notations 
U„i,i and Qm.i follow from the convention in |62|, |63|. 
Note that each element in Um.i is a unit-norm vector while 
each element in Qm,i is a one-dimensional subspace in R™. 
For any given u G Um.i, it can generate a one-dimensional 
subspace G Gm,i- Meanwhile, any given G Gm,i can 
be generated from different u G Um.i'- if = span [u], then 
— span (—It) as well. 

With these definitions, the dictionary D can be interpreted 
as the Cartesian product of d many Stiefel manifolds Um,i- 
Each codeword (column) in D is one element in Um.i- It looks 
straightforward that optimization over D is an optimization 
over the product of Stiefel manifolds. 

What is not so obvious is that the optimization is actually 
over the product of Grassmann manifolds. For any given pair 
{D,X), if the signs of D- i and Xi.- change simultaneously, 
the value of the objective function — Z)X||^ stays the 
same. Let D = [D:^, • • • , £):,i_i, £):^i, • • • , r>:,d] 

and D' = [D,^i, • • • , D,^i^i, -D.i, D..^i+i, ■■ ■ , D..^d]- Then 
it is straightforward to verify that (D) = /[^j {D'). In 
other words, it does not matter what D- i is; what matters is 
the generated subspace span(D: i). As shall become explicit 
later, this phenomenon has significant impacts on algorithm 
design and analysis. 

It is worth noting that the performance of a given dictionary 
is invariant to the permutations of the codewords. However, 
how to effectively address this permutation invariance analyt- 
ically and algorithmically remains an open problem. 



IV. Implementation Details for Primitive SimCO 

This section presents the algorithmic details of primitive 
SimCO. For proof-of-concept, we use a simple gradient de- 
scent method. The gradient computation is detailed in Sub- 
section IIV-AI How to search on the manifold product space 
is specified in Subsection IIV-BI The overall procedure for 
dictionary update is described in Algorithm |2] Note that one 
may apply second-order optimization methods, for example, 
the trust region method f64l, for SimCO. The convergence 
rate is expected to be much faster than that of gradient descent 
methods. However, this is beyond the scope of this paper 



A. Gradient computation 

In this subsection, we compute the fx (D) in (|5]l and the 
corresponding gradient Wfx{D). 

The computation of fx{D) involves solving the corre- 
sponding least squares problem. For a given j G [n], let 
^{■ij) — {* • £ Similarly, we define ft{i,:) = 

{j ■ ihj) G Let Xxnn{:,j),j the sub-vector of X-j 
indexed hy I O ft (:, j), and D..xnn{:.j) be the sub-matrix of 
D composed on the columns indexed hy I D ft {:, j)- It is 
straightforward to verify that 



lYr - D:^xXxjfp — ^ ~ -C^: .InJ^: j) ^XnO(: j) j 

3 = 1 



and 



/x(£>)=X^ ^ inf (X-):.,-D.. 



,Inn(:j)^XnO(:j)j 



fx. AD) 



(7) 

Note that every atomic function fx.j (D) corresponds to a least 
squares problem of the form inf^, \\y — Ax\\'^p. The optimal 
X* admits the following closed-from 



X 



inn{:. j).j 



Xx^, 

, , Vj G [n] 



(8) 



where the superscript f denotes the pseudo-inverse of a matrix. 
In practice, X^^^^. ^ can be computed via low complexity 
methods, for example, the conjugate gradient method |65|, to 
avoid the more computationally expensive pseudo-inverse. 

The gradient of fx (D) is computed as follows. Let 
us consider a general least squares problem f^s (^) = 
infx \\y — Ax\\'^. Clearly the optimal x* = A'^y is a 
function of A. With slight abuse in notations, write fj^s (^) 
as fLs{A,x*). Then 



dfLs{A,x*) , dfLs{A,x*) dx* 



dA 

= -2{y- Ax*)x*^ 
= -2{y-Ax*)x*^, 



dx* 
dx* 

'lA 



dA 



(9) 



where the second equality holds because x* minimizes 
II y — Aa;*||2 and hence — 0. Based on (|9]l, the gradient 
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of fx (D), with respect to D i, i E I, can be computed via 

Vn. Ji (D) ^-2{Y- DX*),n(^,:) ^;,n(v) 

= -2{Y - DX*)X*J. (10) 

Here, (i, :) gives the columns of Y whose sparse represen- 
tation involves the codeword D i. 

When I = [d], the formulas for X* and Vd/ can be 
simplified to 

^kAi = Vje N, and 

Vofid] {D)^-2{Y -DX*)X*^. 

B. Line search along the gradient descent direction 

The line search mechanism used in this paper is significantly 
different from the standard one for the Euclidean space. In a 
standard line search algorithm, the k*^ iteration outputs an 
updated variable x'^^^ via 

a,W=a:('=-i)-t.V./(a;('=-i)), (11) 

where / [x) is the objective function to be minimized, and 
t e R+ is a properly chosen step size. However, a direct 
application of (fTTT i may result in a dictionary D ^ V. 

The line search path in this paper is restricted to the 
product of Grassmann manifolds. This is because, as has been 
discussed in Section |III1 the objective function fx is indeed 
a function on the product of Grassmann manifolds. On the 
Grassmann manifold Qm,i, the geodesic path plays the same 
role as the straight line in the Euclidean space: given any two 
distinct points on Gm,i, the shortest path that connects these 
two points is geodesic Il52l . In particular, let S Gm,i be a 
one-dimensional subspace and u e Um^i be the corresponding 
generator matrix (not unique) Consider a search direction 
h e R" with ||ft.||2 = 1 and h'^u ^ 0. Then the geodesic 
path starting from u along the direction h is given by [62] 

u{t) = u ■ cos t + h ■ sin t, t eM.. 

Note that u{t) = — m (t + tt) and hence span {u (t)) = 
span {u{t + tt)). In practice, one can restrict the search path 
within the interval t E [0,7r). 

For the dictionary update problem at hand, the line search 
path is defined as follows. Let Qi ~ Vd ./i(£)) be the 
gradient vector defined in (fTOl l. We define 

g^=g^- D .Dl^g,, V^ E I, (12) 

so that Qi and D- i are orthogonal. The line search path for 
dictionary update, say D [t), t > {), is given by ll62l 

D.,^,{t) = D.,^, a 1(^1 or ||g.||2 =0' 
D.i (t) = D■,^^cos{\\g^\\2t) - {§1/ \\g^\\^) sm {\\g^\\^ t) 
if i El and ||gi||2 7^ 0. 

(13) 

Algorithm |2] summarizes one iteration of the proposed line 
search algorithm. For proof-of-concept and implementation 
convenience, we use the method of golden section search 

^The generator matrix tt is a vector in this case. 



Algorithm 2 One iteration of the line search algorithm for 
dictionary update. 

Task: Use line search mechanism to update the dictionary D. 
Input: Y, D, X 
Output: D' and X' . 

Parameters: > 0: initial step size, gmin > 0: the threshold 
below which a gradient can be viewed as zero. 
Initialization: Let c = (\/5 - l) /2. 

1) Let ti =0. Compute f [D) using (I?) and the corre- 
sponding gradient gi on the Grassmann manifold using 
dnil and ([Toll. If \\gi\\2 < ffmin for all i E X, then 
D' = D, X' = X, and quit. 

2) Let ^ ct^ and t2 ^ [1 — c) ^4. 

Part A: the goal is to find ^4 > s.t. /(Z)(ti)) > 
f{D{t2)) > f{D{h)) < f{D{U)). Iterate the following 
steps. 

3) If / {D {ti)) < f{D {t2)), then ^4 = ^2, ^3 = ct^ and 
t2 = (1 - c) t4. 

4) Else if / (D {t2)) < f{D (tg)), then U ^ t^, h - ^2 
and i2 = (1 — c) ^4- 

5) Else if / (£> (is)) > f{D {U)), then t2 = h, - U 
and ^4 — t^/c. 

6) Otherwise, quit the iteration. 

Part B: the goal is to shrink the interval length ^4 — ti 
while trying to keep the relation f {D {ti)) > f {D {12)) > 
f {D (t^)). Iterate the following steps until ^4 — ti is suffi- 
ciently small. 

7) If / (D ih)) > f{D {t2)) > f{D then U = ^2, 
t2 = h and ^ ti + c {t^ ~ ti). 

8) Else ti = tg, is = t2 and ^2 = <i + (1 - c) [t^ - ti). 
Output: Let t* = arg min f {D (t)) and D' = D{t*). 

Compute X' according to (O. 



(see |'66l for a detailed description). The idea is to use the 
golden ratio to successively narrow the searching range of 
t inside which a local minimum exists. To implement this 
idea, we design a two-step procedure in Algorithmic in the 
first step (Part A), we increase/decrease the range of t, i.e., 
(0,^4), so that it contains a local minimum and the objective 
function looks unimodal in this range; in the second step (Part 
B), we use the golden ratio to narrow the range so that we 
can accurately locate the minimum. Note that the proposed 
algorithm is by no means optimized. Other ways to do a 
gradient descent efficiently can be found in 165j Chapter 3]. 

V. Implementation Details for Regularized SimCO 

As will be detailed in Section IVII-AI MOD, K-SVD and 
primitive SimCO may result in ill-conditioned dictionaries. 
Regularized SimCO method (|4|i is designed to mitigate this 
problem. 

The ill-condition of the dictionary can be described as fol- 
lows. Fix the sparsity pattern VI. The matrix -D. contains 
the codewords that are involved in representing the training 
sample Y-j. We say the dictionary D is ill-conditioned with 
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respect to the sparsity pattern fl if 



functions, i.e.. 



Amin (-D:^o(:,i)) ^ A,nax j)^ 



for some j g [n]. Here, Amin (•) and Amax (•) give the smallest 
and largest singular values of a matrix, respectively. 

The ill-condition of D brings two problems: 



/x(£»)-E 



inf 



(Yr).., 



D:^xnn{:,j)Xxnn(:.j).j 



(14) 



1) Slow convergence in the dictionary update stage. When functions (O Let 



A 



mill {D.,^n{:,j)j 



is close to zero, the curvature (Hes 



One needs to solve the least squares problems in atomic 

= \lnQ{:,j)\. It is clear that 



sian matrix) of fx (D) is large. The gradient changes 
significantly in the neighborhood of a singular point. 
Gradient descent algorithms typically suffer from a very 
slow convergence rate. 
2) Instability in the subsequent sparse coding stage. When 



Amin (-D:^n(: j) 

squares problem inf ||y^ 



is close to zero, the solution to the least 

1 1 2 

-D:,0(:j)^f2(:j)j||^ 

becomes unstable: small changes in Y^j often result in 
very different least squares solutions X^^, ^.^ It is well 
known that the stability of sparse coding relies on the 
so called restricted isometry condition (RIP) |61 1, which 
requires that the singular values of submatrices of D 
concentrate around 1. An ill-conditioned D violates RIP 
and hence results in sparse coefficients that are sensitive 
to noise. 

It is worth mentioning that the above discussion on the ill- 
condition problem depends upon the unit-norm columns. To 
see this, consider a dictionary with orthonormal columns. It 
is clearly well-conditioned. However, if one picks a column 
of the dictionary matrix and scales it arbitrarily small, the 
resulted dictionary will then become ill-conditioned. Hence, a 
constraint ^ on column norms is necessary for the discussion 
of the condition number of a dictionary. 

It is also worth mentioning the difference between a sta- 
tionary point and an ill-conditioned dictionary. In both cases, 
it is typical that the objective function stops decreasing as 
the number of iterations increases. It is therefore difficult 
to distinguish these two cases by looking at the objective 
function only. However, the difference becomes apparent by 
checking the gradient: the gradient is close to zero in the 
neighborhood of a stationary point while it becomes large 
in the neighborhood of a singular point. This phenomenon is 
not isolated as it was also observed in the manifold learning 
approach for the low-rank matrix completion problem [631. 

To mitigate the problem brought by ill-conditioned dictio- 
naries, we propose regularized SimCO in (|4|i. Note that when 
D is ill-conditioned, the optimal X* for the least squares 
problem in primitive SimCO is typically large. By adding the 
regularization term to the objective function, the search path 
is "pushed" towards a well-conditioned one. 

Algorithm |2] can be directly applied to regularized SimCO. 
The only required modifications are the computations of the 
new objective function (|6]l and the corresponding gradient. 
Similar to primitive SimCO, the objective function ^ in 
regularized SimCO can be decomposed into a sum of atomic 



D V e 



and Xxno(:j)j- £ 



Define 



and D j 



D 



ino(:j) 



y/Jl- I„ 



where 0,„ is the zero vector of length rrij, and /,„ is the 
rrij X TTij identity matrix. The optimal X'^^^^, ^.^ ^ to solve the 
least squares problem in (fl4] i is given by 



X* 



^xnn(:,j},j - ^jYrj. (15) 
The corresponding value of the objective function is therefore 



fxiD) = \\Yr-DX* 



(16) 



The gradient computation is similar to that for primitive 
SimCO. It can be verified that 

VD^,,/i(r>) = -2(F-DX*)X0. (17) 

Replacing (jT) and ([Toll in Algorithm |2] by (O and (fTTl i 
respectively, we obtain a gradient descent implementation for 
regularized SimCO. 

In practice, one may consider first using regularized SimCO 
to obtain a reasonably good dictionary and then employ 
primitive SimCO to refine the dictionary further This two-step 
procedure often results in a well-conditioned dictionary that 
fits the training data. Please see the simulation part (Section 
IVIIl l for an example. 

VI. Convergence of Primitive SimCO 

The focus of this section is on the convergence performance 
of primitive SimCO when the index set I contains only one 
index. The analysis of this case shows the close connection 
between primitive SimCO and K-SVD. More specifically, as 
we discussed in Section |II] when \I\ = 1, the optimization 
formulations of primitive SimCO and K-SVD are exactly the 
same. To solve this optimization problem, primitive SimCO 
uses a gradient descent algorithm while K-SVD employs 
singular value decomposition (SVD). In Theorem \T\ of this 
section, we shall prove that a gradient descent finds a global 
optimum with probability one. Hence, when \I\ — 1, the 
learning performance of primitive SimCO and K-SVD are the 
same. Note that, even though the general case when \I\ > 1 
is more interesting, its convergence is much more difficult to 
analyze. 

The analysis for the case of \I\ = 1 helps in understanding 
where the performance gain of SimCO comes from. Theorem[T] 
shows the equivalence between K-SVD and primitive SimCO 
when |I| = 1 in terms of where to converge. In terms of al- 
gorithmic implementation, K-SVD employs SVD which gives 
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the optimal solution without any iterations visible to users. 
As a comparison, our implementations of SimCO are built 
on gradient descent, which is well-known for its potentially 
slow convergence rate. Nevertheless, our numerical tests show 
similar convergence rates (similar number of iterations) for 
primitive SimCO and K-SVD. This implies that the flexibility 
of updating codewords simultaneously significantly reduces 
the number of iterations. 

When \I\ — 1, the rank-one matrix approximation problem 
arises in both primitive SimCO and K-SVD. Formally, let A E 
j^mxTi ^ matrix, where m > 1 and n > 1 are arbitrary 
positive integers. Without loss of generality, assume that ui ^ 
n. Suppose that the sorted singular values satisfy Ai > A2 > 
A3 > • • • > A„i. Define 

f (u) = min II A — Mty-^ll^ , Vtt G t/m 1. (18) 

The rank-one matrix approximation problem can be written as 
the following optimization problem 

min fiu). (19) 

We shall analyze the performance of gradient descent in the 
rank-one matrix approximation problem. To avoid numerical 
problems that may arise in practical implementations, we 
consider an ideal gradient descent procedure with infinitesimal 
step sizes. (Note that true gradient descent requires infinitesi- 
mal steps.) More specifically, let e be a positive number From 
a given starting point, one takes steps of size e along the 
negative gradient direction until the objective function stops 
decreasing. Letting e approach zero gives the ideal gradient 
descent procedure under consideration. 

The following theorem establishes that the described gradi- 
ent descent procedure finds the best rank-one approximation 
with probabihty one. 

Theorem 1. Consider a matrix A G R™^" and its singular 
value decomposition. Employ the gradient descent procedure 
with infinitesimal steps to solve M8h Suppose the starting 
point, denoted by Uq, is randomly generated from the uniform 
distribution on Um,i- Then the gradient descent procedure 
finds a global minimizer with probability one. 

The proof is detailed in Appendix |A] 

Remark 2. The notion of Grassmann manifold is essential 
in the proof. The reason is that the global minimizer is not 
unique: if u, G Um,i is a global minimizer, then so is —u. In 
other words, only the subspace spanned by a global minimizer 
is unique. 

Remark 3. According to the authors' knowledge, this is the 
first result showing that a gradient search on Grassmann 
manifold solves the rank-one matrix approximation problem. 
In literature, it has been shown that there are multiple station- 
ary points for rank-one matrix approximation problem 1 641 
Proposition 4.6.2]. Our results show that a gradient descent 
method will not converge to stationary points other than 
global minimizers. More recently, the rank-one decomposition 
problem where A2 = A3 = • • • = \„i ~ was studied in ||63l . 
Our proof technique is significantly different as the effects 



of the eigen-spaces corresponding to A2 , • • • , Am need to be 
considered for the rank-one approximation problem. 

VII. Empirical Tests 

In this section, we numerically test the proposed primitive 
and regularized SimCO. In the test of SimCO, all codewords 
are updated simultaneously, i.e., I = [d]. In Section [VII- A| we 
show that MOE0, K-SVD, and primitive SimCO may result 
in an ill-conditioned dictionary while regularized SimCO can 
mitigate this problem. Learning performance of synthetic and 
real data is presented in Sections IVII-BI and IVII-CI respec- 
tively. Running time comparison of different algorithms is 
conducted in Section IVII-DI Note that SimCO algorithms are 
implemented by using simple gradient descent method. Simu- 
lation results suggest that simultaneously updating codewords 
significantly speeds up the convergence and the regularization 
term substantially improves the learning performance. 

A. Ill-conditioned Dictionaries 

In this subsection, we handpick a particular example to 
show that MOD, K-SVD and primitive SimCO may converge 
to an ill-conditioned dictionary. In the example, the training 
samples Y G R^^^^'* are computed via Y = Z>truc^truc, 
where Aruc e Ri6x32 ^ x^,^^ g R32x78^ column 
of X contains exactly 4 nonzero components. We assume that 
the sparse coding stage is perfect, i.e., Otiuc is available. We 
start with a particular choice of the initial dictionary Dq G P. 
The regularization constant pi in regularized SimCO is set to 
pi = 0.01. 

The numerical results are presented in Figure [1] In the left 
sub-figure, we compare the learning performance in terms of 
||1^ — £)X||^. In the middle sub-figure, we study the behavior 
of the gradient V of [D] for different algorithms. In the right 
sub-figure, we depict the condition number of the dictionary 
defined as 

K (D) = max Amax (l?:,0(:j)) / Amin • 

Here, note that k (Dtruo) = 3.39. The results in Figure[T]show 
that 

1) When the number of iterations exceeds 50, MOD, K- 
SVD and primitive SimCO stop improving the training 
performance: the value of / decreases very slowly with 
further iterations. Surprisingly, the gradients in these 
methods do not converge to zero. This implies that these 
methods do not converge to local minimizers. A more 
careful study reveals that these algorithms converge to 
points where the curvature (Hessian) of the objective 
function / (D) is large: the gradient of the objective 
function Vd/ changes dramatically in a small neigh- 
borhood. 

'in the tested MOD, the columns in D are normaHzed after each dictionary 
update. This extra step is performed because many sparse coding algorithms 
requires normalized dictionary. Furthermore, our preliminary simulations (not 
shown in this paper) show that the performance of dictionary update could 
seriously deteriorate if the columns are not normalized. 
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Function values v.s. # of Iterations 



MOD 
■ K-SVD 

- Primitive SimCO 

d SimCO " 



Relative norm of ttie gradient v.s. # of Iterations 



MOD 
■ K-SVD 

- Primitive SimCO 

- Regularized SimCO 



Condition number v.s. # of Iteratioi 



MOD 
■ ■ K-SVD ' 

- Primitive SimCO 

- Regularized SimCO 



Figure 1 : Starting with the same point, the convergence behaviors of MOD, K-SVD, primitive SimCO and regularized SimCO 
are different. In this particular example, only regularized SimCO avoids converging to a singular point. 



2) The above phenomenon can be well explained by check- 
ing the ill-condition of the dictionary. After 100 itera- 
tions, the condition number n (D) remains large (> 10) 
for MOD, K-SVD, and primitive SimCO. 

3) By adding a regularized term and choosing the regular- 
ization constant properly, regularized SimCO avoids the 
convergence to an ill-conditioned dictionary. 

In fact, our simulations in Section IVII-BI show that the 
performance of primitive SimCO is not as good as other 
methods. We tracked all the simulated samples and found that 
it is because primitive SimCO may converge to a singular 
point very fast. Adding the regularization term significantly 
improves the performance (see Sections IVII-BI and IVII-CI ). 
The necessity of regularized SimCO is therefore clear. 



m=1 6, d=32, S=4, # of realization=200, # of iterations=400 





■□ ■ P , . rr 

■ ® - -o 0~ . - 

O- - e c 






MOD 




_Q_ 


K-SVD 




— *t — 


Primitive SImCO 






Regijiarized SimCO 





n : # of training sainpies 

(a) Noiseless case. 



m=1 6, d=32, S=4. # of reaiization=200, # of iterations=400 



B. Experiments on Synthetic Data 

The setting for synthetic data tests is summarized as follows. 
The training samples are generated via Y — Dtruc^truo- 
Here, the columns of .Dtruc are randomly generated from 
the uniform distribution on the Stiefel manifold Um,i- Each 
column of Xtmc contains exactly S many non-zeros: the 
position of the non-zeros are uniformly distributed on the set 

('s) ~ {{*!' ■ ' ' ' • 1 ^ *fe 7^ *f ^ d}; and the values of 
the non-zeros are standard Gaussian distributed. In the tests, 
we fix TO = 16, d = 32, and S — A, and change n, i.e., the 
number of training samples. Note that we intentionally choose 
n to be small, which corresponds to the challenging case. 

We first focus on the performance of dictionary update 
by assuming the true sparsity iluuc is available. Results 
are presented in Fig. |2] Note that the objective function of 
regularized SimCO is different from that of other methods. The 
ideal way to test regularized SimCO is to sequentially decrease 
the regularization constant /i to zero. In practice, we use the 
following simple strategy: the total number of iterations is set 
to 400; we change /i from le — 1 to le — 2, le — 3, and le — 4, 
for every 100 iterations. Simulations show that the average 
performance of regularized SimCO is consistently better than 
that of MOD and K-SVD. Note that there always exists a 
floor in reconstruction error that is proportional to noise. 
The normalized learning performance \\Y ~ DX\\'^p /n is 
presented in Figure |2] The average performance of regularized 
SimCO is consistendy better than that of MOD and K-SVD. 




10" I ' ' ' ' ' 1 

60 70 80 90 100 110 120 

n : # of training sampies 



(b) Noisy case: SNR of training samples is 20dB. Note that 
there always exists a floor in reconstruction en'or that is 
proportional to noise. 

Figure 2: Performance comparison of dictionary update (no 
sparse coding step). 

Then we evaluate the overall dictionary learning perfor- 
mance by combining the dictionary update and sparse coding 
stages. For sparse coding, we adopt the OMP algorithm |25| as 
it has been intensively used for testing the K-SVD method in 
fTO|, |67|. The overall dictionary learning procedure is given 
in Algorithm [T] We refer to the iterations between sparse 
coding and dictionary learning stages as outer-iterations, and 
the iterations within the dictionary update stage as inner- 
iterations. In our test, the number of outer-iterations is set 
to 50, and the number of inner-iterations of is set to 1. 
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Figure 3: Performance comparison of dictionary learning using 
OMP for sparse coding. 

Furthermore, in regularized SimCO, the regularized constant is 
set to /i = le — 1 during the first 30 outer-iterations, and /i = 
during the rest 20 outer-iterations. The normalized learning 
performance — DX\\^p /n is depicted in Figure |2] Again, 
the average performance of regularized SimCO is consistently 
better than that of other methods. 

Note that in the tests presented in this subsection, the 
performance of primitive SimCO is not as good as other 
methods. This motivates and justifies regularized SimCO. 

C. Numerical Results for Image Denoising 

As we mentioned in the introduction part, dictionary learn- 
ing methods have many applications. In this subsection, we 
look at one particular application, i.e., image denoising. Here, 
a corrupted image with noise was used to train the dictionary: 
we take 1,000 (significantly less than 65,000 used in ll67l ) 
blocks (of size 8 x 8) of the corrupted image as training 
samples. The number of codewords in the training dictionary 
is 256. For dictionary learning, we iterate the sparse coding 
and dictionary update stages for 10 times. The sparse coding 
stage is based on the OMP algorithm implemented in ll67l . 
In the dictionary update stage, different algorithms are tested. 
For regularized SimCO, the regularization constant is set to 
/i = 0.05. During each dictionary update stage, the line 
search procedure is only performed once. After the whole 
process of dictionary learning, we use the learned dictionary to 
reconstruct the image. The reconstruction results are presented 
in Fig. |4] While all dictionary learning methods significantly 
improves the image SNRs, the largest gain was obtained from 
regularized SimCO. 

D. Comments on the Running Time 

We compare the running time of different dictionary update 
algorithms in Table |T] It is empirically observed that SimCO 
runs faster than K-SVD but slower than MOD. The speed-up 
compared with K-SVD comes from the simultaneous update of 
codewords. That SimCO is slower than MOD is not surprising 
for the following reasons: MOD also updates all the codewords 
simultaneously; and MOD only requires solving least-squares 
problems, which are much simpler than the optimization 
problem in SimCO. 



Table I: Comparison of running time (in seconds) for dic- 
tionary learning. Note that sparse coding step is included in 
producing Fig. |3] and |4] 





MOD 


K-SVD 


Primitive 
SimCO 


Regularized 
SimCO 


Fig.[2la) 


2.4 X 10* 


2.0 X 10== 


5.1 X 10* 


4.0 X 10* 


Fig.lalb) 


2.3 X 10"* 


1.9 X 10^ 


5.0 X 10* 


4.0 X 10* 


Fig.l3j 


1.5 X 10* 


3.7 X 10* 


3.1 X 10* 


3.1 X 10* 


Fig.|4j 


1.42 


29.43 


2.63 


2.72 



VIII. Conclusions 

We have presented a new framework for dictionary update. 
It is based on optimization on manifolds and allows a si- 
multaneous update of all codewords and the corresponding 
coefficients. Two algorithms, primitive and regularized SimCO 
have been developed. On the theoretical aspect, we have 
established the equivalence between primitive SimCO and K- 
SVD when only one codeword update is considered. On the 
more practical side, numerical results are presented to show 
the good learning performance and fast running speed of 
regularized SimCO. 

Appendix 

A. Proof of Theorem |7] 

The following notations are repeatedly used in the 
proofs. Consider the singular value decomposition A = 
Ei'ii ^iUA,iV^^i, where Ai > A2 > • • ■ > \m > are 
the singular values, and UA,i and VA,i are the left and right 
singular vectors corresponding to Ai respectively. It is clear 
that the objective function / (tt) inft„gRn || A — tti(;-'"||^ 
has two global minimizers ±ua,i- For a given u £ Um.i, the 
angle between u and the closest global minimizer is defined 
as 

e = COS"^ \{u,UAs)\ ■ 

The crux of the proof is that along the gradient descent 
path, the angle is monotonically decreasing. Suppose that 
the starting angle is less than 7r/2. Then the only stationary 
points are when the angle is zero. Hence, the gradient descent 
search converges to a global minimizer The probability one 
part comes from that the starting angle equals to 7r/2 with 
probability zero. 

To formalize the idea, it is assumed that the starting point 
tto G Um,i is randomly generated from the uniform distribu- 
tion on the Stiefel manifold. Define a set S C Z^m,i to describe 
the set of "bad" starting points. It is defined by 

which contains all unit vectors that are orthogonal to ua,i- 
According to 1681 , under the uniform measure on Um.ii the 
measure of the set B is zero. As a result, the starting point 
Uq ^ B with probability one. The reason that we refer to B as 
the set of "bad" starting points is explained by the following 
lemma. 

Lemma 4. Starting from any Uq £ B, a gradient descent path 
stays in the set B. 
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Original clean image Noisy image, 20.1595dB Denolsed by MOD, 30.0979dB 




Denoised by KSVD, 30.7482dB Denolsed by Primitive SimCO, 30.9287dB Denolsed by Regularized SImCO, 30.9306dB 




Figure 4: Example of the image denoising using dictionary learning. PSNR values in dB are given in sub-figure titles. 



Proof: This lemma can be proved by computing the 
gradient of / at a m e B. Let Wu E M" be the op- 
timal solution of the least squares problem in / (u) = 



inf. 



It can be verified that 



and V/ = -2 (A - uwi 



V/ = -2(A 



uwT,) w. 



It is clear that 

= -2(A- uu' 



'A) 



= -2 ^ X'^UA.iU^.iU + 2u (u'^AA^u) 

i 

When Uq E B, it holds that {uo,ua.i) — and 
(V/ (uq) , ua.i) = 0. Since both Uq and the gradient descent 
direction are orthogonal to ma.i, the gradient descent path 
starting from Uq E B stays in B. ■ 
Now consider a starting points Uq ^ B. We shall show that 
the angle 6 is mono tonic ally decreasing along the gradient 
descent path. Towards this end, the notions of directional 
derivative play an important role. View 6* as a function of 
u E Um,i- The directional derivative of 6 si u E Um,i along 
a direction vector h E Hi'", denoted by WhO E M, is defined 

V.. = liin^(- + -^)-^(-). 

Note the relationship between directional derivative and gra- 
dient given by WhS ~ {W9,h). With this definition, the fol- 
lowing lemma plays the central role in establishing Theorem 

m 

Lemma 5. Consider a u E hlm,i such that {u) := 
cos"^ (|(m,ma,i)I) G (0,7r/2). Let hf = -W f {u) be the 
gradient of the objective function f at u. Then it holds 

Vh^e < 0. 

The proof of this lemma is detailed in Appendix |B] 



The implications of this lemma are twofold. First, it implies 
that hf = -V/ ^ for all u such that 9 (u) E (0,7r/2). 
Hence, the only possible stationary points in U„i,i\B are 
Ua.i and —ua,i- Second, starting from tto E B, the angle 
9 decreases along the gradient descent path. As a result, a 
gradient descent path will not enter B. It will converge to 
ua,i or —ua,i- Theorem [U is therefore proved. 

B. Proof of Lemma |5] 

This appendix is devoted to prove Lemma|5] i.e., 'Vhf9 < 0. 
Note that Vhj9 ^ {hf,V9) = (-V/, V6') = V-vef- It 
suffices to show that V -^gf < 0. 

Towards this end, the following definitions are useful. 
Define s = sign (m^ma.i)- Then the vector sua,i is one 
of the two global minimizers that is the closest to u. It can be 
also verified that 9 = cos^^ {u, sua.i)- Furthermore, suppose 
that e (0,7r/2). Define 

SUA 1 — U COS 9 U — SUA 1 COS 9 

hg = — ; — , and u±_ = - 



sin 9 

Clearly, vectors hg and u±^ are well-defined when 9 E 
(0,7r/2). The relationship among u, ua,i, hg and u±_ is 
illustrated in Figure |5] Intuitively, the vector hg is the tangent 
vector that pushes u towards the global minimizer sua,i- 

In the following, we show that V-xjgf — '^hef if we re- 
strict u E Um^i- By the definition of the directional derivative, 
one ha^ 

„ ,. w - eve* 

e^o \\u — eV9\\ 

'^The denominator comes from the restriction tliat u G lAm i ■ 
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span ([ma, 2, ■ ■ ■ ,ua.„:]) ^ 

Figure 5: Illustration of u, ua.i, hg and mj 



Note that 

V6I = V (cos-i (cos 6*)) 
1 

= =V (u. SUA i) = : — 

VI - cos2 9 ^ ' smt 

Since sua.i = u cos 9 + hg sin 9, one has 

e 



1 



u — eV9 = u 



sm I 



(sMA,l) 



= «(! + £ COS 9 1 sin 6*) + ehg . 

Substitute it back to V-ve^. One has V -'^gu — hg. In other 
words, if It e U,n,i, then V-^gf = Vhef- 

To compute Vfi^,/, note that / (m) = || A — = 
\\A\\^p — . Now define 

g{u) ^ \\u^A\\l. 

Then clearly Vh^f = —^he9- To proceed, we also de- 
compose A as follows. Recall the SVD of A given by 
^ = S"=i ^iUAaVA^i- Let C/a,± G Urn,m-i Contain the left 
singular vectors corresponding to A2, • • • , Am, i.e., Ua.± = 
[ua,2, ■ ■ ■ ,UA,m]- Similarly define Va.±- Then, 



A = [ma,1' t^A,±]diag ([Ai, • • • , A„]) 

= [ua.1,Ua,±] [wA.l,WA.±f , 



"^A.l 



where waa = Kvaa for i — I,-- - ,m, and Wa,± = 
['Wa,2,''' I'WA.m]- It is Straightforward to verify that 
w^Wa^± = 0. 

The function g {u) can be decomposed into two parts. Note 
that 



U [ma,1, C^A,±] [«^A,1, Wa, 

u'^ua,iWa^i\\1 + \\u'^Ua,±W'a^±\\1 
- 2 (m^wa.iioa.i, w^C^a,±WXj.) 

.T " 



U UasWa.i\ 



WuA^i^wi,_ 



I2 ' 



where the last equality follows from that W^^t^A = and 
hence 

{vFuaaw^^i,'u^Ua.i^Wa.a_) = 0. 

To further simplify g (u), note that cos^? = |m"^ma|- Further- 
more, it is straightforward to verify that the projection of u 



on span {Ua,±) is given by Ua.±U'^ = u± sm9. Define 
ur = W^.^u^ e Then, = 1 and 



\u'^Ua,i_WU\^ 

= smH\\ulUA^±Wl^\\l 

= sin^ eiM^diag ( [A2, . . . , A2 J ) ul. 



Hence, 



g (it) = cos^ 9- Xi+ sin^ eitt^diag ( [A^, • • • , A^] ) ur. 

It is now ready to decide the sign of Vhed- It is straight- 
forward to verify that 

Vhe COS 9 = lim / ^ ^ , SUA 1 ) = sin f 



and similarly Vhe sin 9 — - 



COS 6*. Therefore, 

u cos 9sua.i 

sin 9 



hg sin 61 + Mcos6' — sua,i 



sm 6 
SUA.l - suas 

sin^ 9 



and VhgUR = Vhe [Ua ±'^^J = ^- Hence, one has 
Vh«.g = sin 261 (Ai - M^Jdiag ([A^ • • • ,XI,])ur) 



Note that 



u^dmg {[Xl, ■ ■ ■ ,Xl^])uR 
<URdiag{[Xl--- ,Xl])uR^X2 < Ai. 



It can be concluded that when 9 e (0,7r/2), Vh,e5 > and 
^hgf = ^^hsd < 0. Lemma|5]is therefore proved. 
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